feat: Add exponential backoff and log deduplication to Spotlight#5025
feat: Add exponential backoff and log deduplication to Spotlight#5025mattico wants to merge 4 commits intogetsentry:mainfrom
Conversation
…sport When the Spotlight server is unreachable, the SDK now logs the error only once and implements exponential backoff (1s initial, 60s max) before retrying, per the Spotlight error handling spec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment was marked as outdated.
This comment was marked as outdated.
|
The SDK docs suggest this retry logic: def send(self, envelope):
try:
# Attempt to send
self._send_envelope(envelope)
# Reset retry delay on success
self.retry_delay = 1.0
self.error_logged = False
except ConnectionError as e:
# Exponential backoff
if not self.error_logged:
logger.error(f"Spotlight server unreachable at {self.url}: {e}")
self.error_logged = True
# Wait before retry
time.sleep(self.retry_delay)
self.retry_delay = min(self.retry_delay * 2, self.max_retry_delay)
# Retry once, then give up for this envelope
try:
self._send_envelope(envelope)
self.retry_delay = 1.0
except ConnectionError:
# Silently drop envelope after retry
passWhile the actual sentry-python SDK implements this retry logic: def capture_envelope(self, envelope):
# type: (Envelope) -> None
# Check if we're in backoff period - skip sending to avoid blocking
if self._last_error_time > 0:
time_since_error = time.time() - self._last_error_time
if time_since_error < self._retry_delay:
# Still in backoff period, skip this envelope
return
body = io.BytesIO()
envelope.serialize_into(body)
try:
req = self.http.request(
url=self.url,
body=body.getvalue(),
method="POST",
headers={
"Content-Type": "application/x-sentry-envelope",
},
)
req.close()
# Success - reset backoff state
self._retry_delay = self.INITIAL_RETRY_DELAY
self._last_error_time = 0.0
except Exception as e:
self._last_error_time = time.time()
# Increase backoff delay exponentially first, so logged value matches actual wait
self._retry_delay = min(self._retry_delay * 2, self.MAX_RETRY_DELAY)
# Log error once per backoff cycle (we skip sends during backoff, so only one failure per cycle)
sentry_logger.warning(
"Failed to send envelope to Spotlight at %s: %s. "
"Will retry after %.1f seconds.",
self.url,
e,
self._retry_delay,
)There are some differences:
I suppose both the example code and the SDK do fit the stated requirements, just differently. I implemented behavior closer to the Python SDK than the docs suggestion, which I think is better anyhow. The one difference I can see with my implementation compared to the Python SDK is it logs one warning per backoff cycle while this logs only the first error (resetting after a success). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5025 +/- ##
==========================================
+ Coverage 73.91% 74.10% +0.19%
==========================================
Files 497 499 +2
Lines 17974 18079 +105
Branches 3517 3514 -3
==========================================
+ Hits 13285 13397 +112
+ Misses 3833 3826 -7
Partials 856 856 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| using var response = await _httpClient.SendAsync(request, cancellationToken).ConfigureAwait(false); | ||
| await HandleResponseAsync(response, processedEnvelope, cancellationToken).ConfigureAwait(false); | ||
|
|
||
| _backoff.RecordSuccess(); |
There was a problem hiding this comment.
Bug: The backoff mechanism in SpotlightHttpTransport is incorrectly reset on HTTP error responses (4xx/5xx) because _backoff.RecordSuccess() is called regardless of the response status.
Severity: HIGH
Suggested Fix
The HandleResponseAsync method should return a status indicating success. SpotlightHttpTransport should then check this status. Call _backoff.RecordSuccess() only if the response was successful (e.g., 200 OK). Otherwise, call _backoff.RecordFailure() to correctly trigger the exponential backoff logic for HTTP errors.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/Sentry/Http/SpotlightHttpTransport.cs#L55
Potential issue: In `SpotlightHttpTransport.SendEnvelopeAsync`, the
`_backoff.RecordSuccess()` method is called after `HandleResponseAsync`. However,
`HandleResponseAsync` does not throw an exception for non-200 HTTP status codes. If the
Spotlight server returns an HTTP error (e.g., 500, 503), the backoff state is
incorrectly reset, and the retry delay is not increased. This defeats the exponential
backoff mechanism for HTTP-level failures, causing the SDK to spam a struggling server
instead of waiting. The intended error logging for this failure path is also skipped.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
To my knowledge the python SDK does not consider an HTTP error response as a failure for backoff purposes (urllib3 doesn't throw for those). And the SDK docs don't either (only catch ConnectionError).
I haven't thought deeply about it otherwise.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
|
Thanks heaps for the PR @mattico ! It looks pretty good to me. There are some issues with a couple of our CI checks when running against forks of the repo, which I'm asking for advice on. Once we've worked through those though, I think we should be able to get this reviewed and merged in. Apologies for the delay... |

Closes #3481.
Implements the error handling behavior from the SDK docs Spotlight spec. When the Spotlight server is unreachable, the SDK now logs the error only once (resetting after success) and implements exponential backoff (1s initial, 60s max) before retrying.